Skip to content

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

License

Notifications You must be signed in to change notification settings

OpenGVLab/Vision-RWKV

Repository files navigation

Vision-RWKV

The official implementation of "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures".

News🚀🚀🚀

  • 2024/04/14: We support rwkv6 in classification task, higher performance!
  • 2024/03/04: We release the code and models of Vision-RWKV.

Highlights

  • High-Resolution Efficiency: Processed high-resolution images smoothly with a global receptive field.
  • Scalability: Pre-trained with large-scale datasets and posses scale up stablity.
  • Superior Performance: Achieved a better performance in classfication tasks than ViTs. Surpassed window-based ViTs and comparabled to global attention ViTs with lower flops and higher speed in dense prediction tasks.
  • Efficient Alternative: Capability to be an alternative backbone to ViT in comprehensive vision tasks.
image

Overview

image

Schedule

  • Support RWKV6 as VRWKV6
  • Release VRWKV-L
  • Release VRWKV-T/S/B

Model Zoo

Pretrained Models

Model Size Pretrain Download
VRWKV-L 192 ImageNet-22K ckpt

Image Classification (ImageNet-1K)

Model Size #Param #FLOPs Top-1 Acc Download
VRWKV-T 224 6.2M 1.2G 75.1 ckpt | cfg
VRWKV-S 224 23.8M 4.6G 80.1 ckpt | cfg
VRWKV-B 224 93.7M 18.2G 82.0 ckpt | cfg
VRWKV-L 384 334.9M 189.5G 86.0 ckpt | cfg
VRWKV6-T 224 7.6M 1.6G 76.6 ckpt | cfg
VRWKV6-S 224 27.7M 5.6G 81.1 ckpt | cfg
VRWKV6-B 224 104.9M 20.9G 82.6 ckpt | cfg
  • VRWKV-L is pretrained on ImageNet-22K and then finetuned on ImageNet-1K.
  • We train VRWKV-L with the internimage codebase for a higher speed.

Object Detection with Mask-RCNN head (COCO)

Model #Param #FLOPs box AP mask AP Download
VRWKV-T 8.4M 67.9G 41.7 38.0 ckpt | cfg
VRWKV-S 29.3M 189.9G 44.8 40.2 ckpt | cfg
VRWKV-B 106.6M 599.0G 46.8 41.7 ckpt | cfg
VRWKV-L 351.9M 1730.6G 50.6 44.9 ckpt | cfg
  • We report the #Param and #FLOPs of the backbone in this table.

Semantic Segmentation with UperNet head (ADE20K)

Model #Param #FLOPs mIoU Download
VRWKV-T 8.4M 16.6G 43.3 ckpt | cfg
VRWKV-S 29.3M 46.3G 47.2 ckpt | cfg
VRWKV-B 106.6M 146.0G 49.2 ckpt | cfg
VRWKV-L 351.9M 421.9G 53.5 ckpt | cfg
  • We report the #Param and #FLOPs of the backbone in this table.

Citation

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{duan2024vrwkv,
  title={Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures},
  author={Duan, Yuchen and Wang, Weiyun and Chen, Zhe and Zhu, Xizhou and Lu, Lewei and Lu, Tong and Qiao, Yu and Li, Hongsheng and Dai, Jifeng and Wang, Wenhai},
  journal={arXiv preprint arXiv:2403.02308},
  year={2024}
}

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

Acknowledgement

Vision-RWKV is built with reference to the code of the following projects: RWKV, MMPretrain, MMDetection, MMSegmentation, ViT-Adapter, InternImage. Thanks for their awesome work!